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This article is the next article from a series of articles on Parquet. You should check 
the previous Parquet article before reading this one if you don’t have any Parquet 
knowledge, but it is also a great reminder for people more advanced. If you want to 


reproduce the input data for this article, the code is at the end. 


Apache Parquet is a columnar storage format for big data frameworks, such as 
Apache Hadoop and Apache Spark. It is designed to improve the performance 
of big data processing by using a columnar storage format, which stores data 


in a compressed and efficient way. 


Parquet adoption continues to increase as more and more organizations turn 
to big data technologies to process and analyze large datasets. With this 
continuous development, it is important that everyone learns some best 


practices and how to navigate through Parquet files. 


In this tutorial, we will show you how to gain maximum insight into your 
Parquet data as a Parquet user without resorting to the common brute force of 


loading it for understanding. 


The case study 


To do this, we provide you with a case study in which a Data Engineer has 
provided you with loan applicants' data and you need to create predictive 
models with that data. But first, you need to “technically” discover the data. 
And it is huge data. 


Indeed, the Data Engineer that prepared the data, tells you that the Parquet 
folder is 1TB large (for educational purposes only, this is not the case in our 
example), so if you try to load everything, you will run through Memory Error 


on your machine. 


Don't worry, we will provide you with the most efficient way to understand 


the big Parquet data without even loading the Parquet data in memory. 
That means answering the questions below: 


e What do the Parquet files in this folder look like? 
e What variables are inside? With what types? Some statistics? 


e how is the data partitioned? 


We will also teach you how to reformat partitions if you notice that 


something is wrong with the way the data has been partitioned. 


Reading the first Parquet file 


The import that you will need for this tutorial: 


import pyarrow as pa 
import pyarrow.parquet as pq 
import os 


First of all, we want to get an idea of what the folder 
‘APPLICATIONS_PARTITIONED’ contains, this is where the data is stored. 


Since you don’t know how the data is partitioned, you cannot just load the 
whole folder blindly because you'll be loading all the Parquet files and that’s 
not what you want to do (remember 1TB size), but rather you want to get an 


overview of your data. 


Here, I give youa function get_first_parquet_from_path() that will return the 
first Parquet file that is in a directory. The function will scan through each 
directory and subdirectory until it finds a Parquet file and will return the 


complete path of this single file. 


def get_first_parquet_from_path(path) : 
for (dir_path, _, files) in os.walk(path) : 
for f in files: 
if f.endswith(". parquet"): 


first_pq_path = os.path.join(dir_path, f) 
return first_pq_path 


Looks like a cool function, let’s put it into practice. 


path = 'APPLICATIONS_PARTITIONED' 

first_pq = get_first_parquet_from_path (path) 

first_pq 

#Output : APPLICATIONS PARTITIONED/NAME_INCOME_TYPE=Commercial associate/CODE_GE 


< CD > 


We notice from the path that this is partitioned by NAME_INCOME_TYPE 
and CODE_GENDER, good to know. 


To read this path now to get the number of rows and columns and also the 


precious Schema here is what you can do: 


first_ds = pq.read_table(first_pq) 
first_ds.num_rows, first_ds.num_columns, first_ds.schema 


(637800, 

17, 

ID: int64 

FLAG OWN_CAR: string 
FLAG_OWN_REALTY: string 
CNT_CHILDREN: int64 
AMT_INCOME_TOTAL: double 
NAME EDUCATION TYPE: string 
NAME_FAMILY_STATUS: string 
NAME_HOUSING_TYPE: string 
DAYS_EMPLOYED: int64 

FLAG MOBIL: bool 

FLAG WORK_PHONE: bool 

FLAG PHONE: bool 

FLAG EMAIL: bool 
OCCUPATION_TYPE: string 
CNT_FAM_MEMBERS: double 
MONTH_INCOME_TOTAL: double 
AGE: double 

-- schema metadata -- 
pandas: ‘{"index_columns": [], “column_indexes": [{"name": null, "field_n' + 2567) 


CPU times: total: 391 ms 
Wall time: 61.2 ms 


It took less than 1 second to run, the reason is that the read_table() function 
reads a Parquet file and returns a PyArrow Table object, which represents 


your data as an optimized data structure developed by Apache Arrow. 


Now, we know that there are 637800 rows and 17 columns (+2 coming from 


the path), and have an overview of the variables and their types. 


Wait, I told you before that we won't need to load anything in memory to 


discover the data. So here is a method to do it without reading any table. 


Metadata 


I am partially tricking you because we won't be loading any data but we will 


be loading something called metadata. 


In the context of the Parquet file format, metadata refers to data that 
describes the structure and characteristics of the data stored in the file. This 
includes information such as the data types of each column, the names of 


the columns, the number of rows in the table, and the schema. 


Let’s use both read_metadata() and read_schema() function from 


pyarrow. parquet: 


ts=pq.read_metadata(first_pq) 
ts.num_rows, ts.num_columns, pq.read_schema(first_pq) 


CPU times: total: @ ns 
Wall time: 6.98 ms 


This is giving you the same output as with read_table() . 


However, we notice that in the execution times there is a big difference 
because here it is close to instantaneous. And this is not a surprise because 
reading the metadata is like reading a very small part of the Parquet file that 


contains everything you need to have an overview of the data. 


Statistics 


Now let’s say I want to know a little more about the columns what can I do? 


You could read the statistics from the first Row Group of a file. 


A Row Group in the Parquet file format is a collection of rows that are stored 


together as a unit and divided into smaller chunks for efficient querying and 


processing. 


parquet_file = pq.ParquetFile(first_pq) 

ts=parquet_file.metadata.row_group(@) 

for nm in range(ts.num_coLumns) : 
print(ts.column(nm) ) 


This code above will give you an ugly output, here is some code to format it 


into a beautiful DataFrame: 


beautiful_df = pd.DataFrame() 
for nm in range(ts.num_coLumns) : 
path_in_schema = ts.column(nm) .path_in_schema 
compressed_size = ts.coLlumn(nm) .total_compressed_size 
stats = ts.column(nm).statistics 
min_value = stats.min 
max_value = stats.max 
physical_type = stats.physical_type 
beautiful_df[path_in_schema] = pd.DataFrame([physical_type, min_value, max_v 
df = beautiful_df.T 
df.columns = ['DTYPE', 'Min', 'Max', 'Compressed_Size_(KO)'] 
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DTYPE Min Max Compressed_Size_(KO) 


ID INT64 3 32762 29081 

FLAG_OWN_CAR BYTE_ARRAY N Y 708 
FLAG_OWN_REALTY BYTE_ARRAY N Y 546 
CNT_CHILDREN INT64 0 2 1072 
AMT_INCOME_TOTAL DOUBLE 31500.0 1125000.0 3424 
NAME_EDUCATION_TYPE BYTE_ARRAY Academic degree Secondary / secondary special 1226 
NAME_FAMILY_STATUS BYTE_ARRAY Civil marriage Widow 1319 
NAME_HOUSING TYPE BYTE_ARRAY Co-op apartment With parents 604 
DAYS_EMPLOYED INT64 -13735 -66 7512 
FLAG_MOBIL BOOLEAN True True 78 
FLAG_WORK_PHONE BOOLEAN False True 504 
FLAG_PHONE BOOLEAN False True 608 
FLAG_EMAIL BOOLEAN False True 424 
OCCUPATION_TYPE BYTE_ARRAY Accountants Waiters/barmen staff 2530 
CNT_FAM_MEMBERS DOUBLE 1.0 6.0 1373 
MONTH_INCOME_TOTAL DOUBLE 2625.0 93750.0 3412 
AGE DOUBLE 21.0 63.0 2968 


On the DataFrame you have the type of the columns, the minimum, the 


maximum, and the compressed size. Few learnings from this file: 
e Strings columns were converted to BYTE_ARRAY. 


e Min and Max for String columns are sorted alphabetically. 


e The compression size of Boolean is not so much better than the 
BYTE_ARRAY. 


e The youngest applicant is 21 years old, and the oldest is 68 years old. 


Becare to not generalize the statistics, it is just from the first parquet file! 


Great, now we have a good understanding of the data such as info on 


columns, types, schemas, and even statistics, but don’t we miss something? 


Partitions 


Yes, we don’t know the partitions of the data! As said before, we could guess 


at least the partitioning columns from the file path: 


"APPLICATIONS_PARTITIONED/NAME_INCOME_TYPE=Commercial associate/CODE_GENDER=F/6183f182ab0b47¢49¢f56a3e09a3a7b1-G. parquet’ 


the data is partitioned on NAME_INCOME_TYPE and CODE_GENDER. But we 
don’t know the other partition values. Suppose that we want to look at other 
NAME_INCOME_TYPE? 
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def get_all_partitions(path): 
partitions = {} 
i=0 
for (_, partitions_layer, _) in os.walk(path) : 
if len(partitions_lLayer)>0: 
key = partitions_layer[0].split('=') [0] 
partitions[key] = sorted([partitions_layer[i].split('=')[1] for i in 
else: 
break 
return partitions 
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Let’s run this function that returns a dictionary with the keys corresponding 
to the partitions columns, and values are the associated partitions values to 


each partition column. 


ps = get_all_partitions(path) 
ps.keys(), ps.values() 


We know now that the Data Engineer partitioned the data first by 
Income_Type and then by Gender. And all the values for the partition columns 


are listed below: 


(dict_keys(["NAME_INCOME_TYPE’, 'CODE_GENDER"]), 
dict_values([['Commercial associate’, ‘Pensioner’, ‘State servant’, ‘Student’, "Working'], ['F', 'M', ‘Secondary']])) 


Now that we have the partition columns and values knowledge, we can read 


another partition of our interest. 


Suppose that we want to read all the data of ‘Pensioner’ regardless of Gender. 


From the last tutorial, we know that we can do that by reading the Parquet 
folder ‘APPLICATIONS_PARTITIONED/NAME_INCOME_TYPE=Pensioner’ 


df_pensioner = pd.read_parquet('APPLICATIONS_PARTITIONED/NAME_INCOME_TYPE=Pensio 


< CD > 


Reformat the partitions 


Actually, we have no interest in splitting the data by gender and the size of 
the data allows us to read data from both genders without an excessive 


runtime. 


It is important to not over-partition your data because, in general, the 
execution time is increased by the number of partitions within the folder. So 
you have to keep in mind that there is a potential downside to partitions, 
even if it makes the data more functionally readable. (From the official 


documentation 512MB — 1GB is the optimal size for a partition). 


Here, lets say we assume that the subfolders of the genders are small 
enough after checking the data, and we find that the functional division of 
the genders is not useful. We decide to reformat the dataset to be partitioned 
only by NAME_INCOME_TYPE: 


pq_table = pq.read_table('APPLICATIONS_PARTITIONED' ) 
pq.write_to_dataset(pq_table, 'APPLICATIONS_REPARTITIONED', partition_cols=['NAM 


We just read the data in PyArrow Table object then we wrote a Parquet file 
partitioned only on NAME_INCOME_TYPE and no more per Gender. If we 


run now the function get_all_partitions() with the values: 


partitions = get_all_partitions('APPLICATIONS_REPARTITIONED' ) 
partitions. keys(), partitions.values() 


(dict_keys(["NAME_INCOME_TYPE']), 
dict_values([['Commercial associate’, ‘Pensioner’, ‘State servant’, ‘Student’, ‘Working']])) 


We notice that we don’t have the partitioning by Gender anymore. 


In conclusion, you have just seen how to navigate through Parquet files to 


know everything about the data before loading it: like column names, size, 


schema, statistics and how to get partition names and values. You also found 


out how to reformat the partitions to be more technically and functionally 


correct. 


Thanks for reading and see you in the next story! 


The full code to generate the input data we used: 


1  ## Bash command 
2 pip install kaggle 
3 kaggle datasets download -d rikdifos/credit-card-approval-prediction 
4  shutil.unpack_archive('credit-card-approval-prediction.zip' ) 
5 
6  ## Python code 
7 applications = pd.read_csv('application_record.csv' ) 
8 applications = pd.concat(10*[applications]).reset_index().drop(columns=['ID', 'index']).reset_ind 
9 applications['MONTH_INCOME_TOTAL'] = applications['AMT_INCOME_TOTAL']/12 
10 applications['AGE'] = - np.floor(applications[ 'DAYS_BIRTH']/365) 
11  my_schema = pa.Schema.from_pandas(applications) 
12  my_schema = my_schema.set(12, pa.field('FLAG_MOBIL', '‘bool')) 
13 my_schema = my_schema.set(13, pa.field('FLAG_WORK_PHONE', ‘bool')) 
14 my _schema = my_schema.set(14, pa.field('FLAG_PHONE', '‘bool')) 
15  my_schema = my_schema.set(15, pa.field('FLAG_EMAIL', 'bool')) 
16 my_schema = my_schema.remove(10) 
17 applications.to_parquet('APPLICATIONS PARTITIONED', schema = my_schema, partition_cols=[ 'NAME_IN 
< CD > 
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